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Abstract. We provide a new framework for generating multiple good quality 
partitions (clusterings) of a single data set. Our approach decomposes this problem 
into two components, generating many high-quality partitions, and then grouping 
these partitions to obtain k representatives. The decomposition makes the approach 
extremely modular and allows us to optimize various criteria that control the choice 
of representative partitions. 

1 Introduction 

Clustering is a critical tool used to understand the structure of a data set. There 
are many ways in which one might partition a data set into representative clus- 
ters, and this is demonstrated by the huge variety of different algorithms for 
clustering H9I35I8I27I45I3 11431 141281 . 

Each clustering method identifies different kinds of structure in data, reflect- 
ing different desires of the end user. Thus, a key exploratory tool is identifying 
a diverse and meaningful collection of partitions of a data set, in the hope that 
these distinct partitions will yield different insights about the underlying data. 

Problem specification. The input to our problem is a single data set X. The 
output is a set of k partitions of X. A partition of X is a set of subsets X,- = 
{X iA ,X it2 , . . . ,X i>s } where X = [j] , A',- / and for all j,f X u ' X if = 0. Let 3> x 
be the space of all partitions of X; since X is fixed throughout this paper, we just 
refer to this space as "P. 

There are two quantities that control the nature of the partitions generated. 
The quality of a partition, represented by a function Q : "P — > M + , measures 
the degree to which a particular partition captures intrinsic structure in data; in 
general, most clustering algorithms that identify a single clustering attempt to 
optimize some notion of quality. The distance between partitions, represented by 
the function d : 7 x T — > M, is a quantity measuring how dissimilar two partitions 
are. The partitions X,- G 7 that do a better job of capturing the structure of the 
data set X will have a larger quality value <2(X,). And the partitions X,-, Xy G T 
that are more similar to each other will have a smaller distance value d(Xi, Xy). 
A good set of diverse partitions all have large distances from each other and all 
have high quality scores. 
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Thus, the goal is this paper is to generate a set of k partitions that best 
represent all high-quality partitions as accurately as possible. 

Related Work. There are two main approaches in the literature for computing 
many high-quality, diverse partitions. However, both approaches focus only on a 
specific subproblem. Alternate clustering focuses on generating one additional 
partition of high-quality that should be far from a given set (typically of size 
one) of existing partitions, k-consensus clustering assumes an input set of many 
partitions, and then seeks to return k representative partitions. 

Most algorithms for generating alternate partitions B38I16I6I5I13I21I121 oper- 
ate as follows. Generate a single partition using a clustering algorithm of choice. 
Next, find another partition that is both far from the first partition and of high 
quality. Most methods stop here, but a few methods try to discover more alternate 
partitions; they repeatedly find new, still high-quality, partitions that are far from 
all existing partitions. This effectively produces a variety of partitions, but the 
quality of each successive partition degrades quickly. 

Although there are a few other methods that try to discover alternate partitions 
simultaneously II10I29I37I , they are usually limited to discovering two partitions 
of the data. Other methods that generate more than just two partitions either 
randomly weigh the features or project the data onto different subspaces, but use 
the same clustering technique to get the alternate partitions in each round. Using 
the same clustering technique tends to generate partitions with clusters of similar 
shapes and might not be able to exploit all the structure in the data. 

The problem of /^-consensus, which takes as input a set of m 3> k partitions 
of a single data set to produce k distinct partitions, has not been studied as exten- 
sively. To obtain the input for this approach, either the output of several distinct 
clustering algorithms, or the output of multiple runs of the same randomized 
algorithm with different initial seeds are considered H46I471 . This problem can 
then be viewed as a clustering problem; that is, finding k clusters of partitions 
from the set of input partitions. Therefore, there are many possible optimization 
criteria or algorithms that could be explored for this problem as there are for 
clustering in general. Most formal optimization problems are intractable to solve 
exactly, making heuristics the only option. Furthermore, no matter the technique, 
the solution is only as good as the input set of partitions, independent of the opti- 
mization objective. In most ^-consensus approaches, the set of input partitions is 
usually not diverse enough to give a good solution. 

In both cases, these subproblems avoid the full objective of constructing a 
diverse set of partitions that represent the landscape of all high-quality partitions. 
The alternate clustering approach is often too reliant on the initial partition, has 
had only limited success in generalizing the initial step to generate k partitions. 
The ^-consensus partitioning approach does not verify that its input represents 
the space of all high-quality partitions, so a representative set of those input 
partitions is not necessarily a representative set of all high-quality partitions. 



Our approach. To generate multiple good partitions, we present a new paradigm 
which decouples the notion of distance between partitions and the quality of 
partitions. Prior methods that generate multiple diverse partitions cannot explore 
the space of partitions entirely since the distance component in their objective 
functions biases against partitions close to the previously generated ones. These 
could be interesting partitions that might now be left out. To avoid this, we 
will first look at the space of all partitions more thoroughly and then pick non- 
redundant partitions from this set. Let k be the number of diverse partitions that 
we seek. Our approach works in two steps. In the first step called the generation 
step, we first sample from the space of all partitions proportional to their quality. 
Stirling numbers of the second kind, S(n,s) is the number of ways of partitioning 
a set of n elements into s nonempty subsets. Therefore, this is the size of the space 
that we sample from. We illustrate the sampling in figure [T] This generates a set 
of size m>ito ensure we get a diverse sample that represents the space of all 
partitions well, since generating only k partitions in this phase may "accidentally" 
miss some high quality region of 7. Next, in the grouping step, we cluster this 
set of m partitions into k sets, resulting in k clusters of partitions. We then return 
one representative from each of these k clusters as our output alternate partitions. 



■ : Samples generated proportional to quality 




-Pi Space of all partitions RP Ps(n^] 

Fig. 1. Sampling partitions proportional to its quality from the space of all partitions with s 
clusters. 

Note that because the generation step is decoupled from the grouping step, 
we treat all partitions fairly, independent of how far they are from the existing 
partitions. This allows us to explore the true density of high quality partitions 
in y without interference from the choice of initial partition. Thus, if there is a 
dense set of close interesting partitions our approach will recognize that. Also 
because the grouping step is run separate from the generation step, we can 
abstract this problem to a generic clustering problem, and we can choose one of 



many approaches. This allows us to capture different properties of the diversity 
of partitions, for instance, either guided just by the spatial distance between 
partitions, or also by a density-based distance which only takes into account the 
number of high-quality partitions assigned to a cluster. 

From our experimental evaluation, we note that decoupling the generation 
step from the grouping step helps as we are able to generate a lot of very high 
quality partitions. In fact, the quality of some of the generated partitions is better 
than the quality of the partition obtained by a consensus clustering technique 
called LiftSSD ll39ll . The relative quality w.r.t. the reference partition of a few 
generated partitions even reach close to 1. To our best knowledge, such partitions 
have not been uncovered by other previous meta-clustering techniques. The 
grouping step also picks out representative partitions far-away from each other. 
We observe this by computing the closest-pair distance between representatives 
and comparing it against the distance values of the partitions to their closest 
representative. 

Outline. In Section [2j we discuss a sampling-based approach for generating 
many partitions proportional to their quality; i.e. the higher the quality of a 
partition, the more likely it is to be sampled. In Section [3] we describe how to 
choose k representative partitions from the large collection partitions already 
generated. We will present the results of our approach in Section [4] We have 
tested our algorithms on a synthetic dataset, a standard clustering dataset from 
the UCI repository and a subset of images from the Yale Face database B. 

2 Generating Many High Quality Partitions 

In this section we describe how to generate many high quality partitions. This 
requires (1) a measure of quality, and (2) an algorithm that generates a partition 
with probability proportional to its quality. 

2.1 Quality of Partitions 

Most work on clustering validity criteria look at a combination of how compact 
clusters are and how separated two clusters are. Some of the popular measures 
that follow this theme are SJ)bw, CDbw, SD validity index, maximum likelihood 
and Dunn index 1123124136144115141321171 . Ackerman et. al. also discuss similar 
notions of quality, namely VR (variance ratio) and WPR (worst pair ratio) in their 
study of clusterability [12]. We briefly describe a few specific notions of quality 
below. 

&-Means quality. If the elements x € X belong to a metric space with an un- 
derlying distance 5:XxX-yl and each cluster X,j in a partition X; is rep- 
resented by a single element xj, then we can measure the inverse quality of a 
cluster by q(Xjj) = Y^xex t ■ S(x,xj) 2 . Then the quality of the entire partition is 
then the inverse of the sum of the inverse qualities of the individual clusters: 

G(Xi) = V(E5=i 



This corresponds to the quality optimized by s-mean clustering^ and is quite 
popular, but is susceptible to outliers. If all but one element of X fit neatly in s 
clusters, but the one remaining point is far away, then this one point dominates the 
cost of the clustering, even if it is effectively noise. Specifically, the quality score 
of this measure is dominated by the points which fit least well in the clusters, as 
opposed to the points which are best representative of the true data. Hence, this 
quality measure may not paint an accurate picture about the partition. 

Kernel distance quality. We introduce a method to compute quality of a parti- 
tion, based on the kernel distance l30ll . Here we start with a similarity function 
between two elements of X, typically in the form of a (positive definite) kernel: 
K : X x X — > M + . If x\,X2 6X are more similar, then K(x\,X2) is smaller than 
if they are less similar. Then the overall similarity score between two clusters 
X{j,Xj ji G X; is defined K(Xij,Xjji) = Y,xex t j Hx'ex K(x,x'), and a single clus- 
ters self-similarity forX,j € X, is defined K(Xij,Xij). Finally, the overall quality 
of a partition is denned Qjr(X,-) = £}=i K(Xjj,Xjj). 

If X is a metric space, the highest quality partitions divide X into s Voronoi 
cells around s points - similar to s-means clustering. However, its score is 
dominated by the points which are a good fit to a cluster, rather than outlier 
points which do not fit well in any cluster. This is a consequence of how kernels 
like the Gaussian kernel taper off with distance, and is the reason we recommend 
this measure of cluster quality in our experiments. 

2.2 Generation of Partitions Proportional to Quality 

We now discuss how to generate a sample of partitions proportional to their 
quality. This procedure will be independent of the measure of quality used, so 
we will generically let <2(X,) denote the quality of a partition. Now the problem 
becomes to generate a set Y C CP of partitions where each X,- G Y is drawn 
randomly proportionally to <2(X,). 

The standard tool for this problem framework is a Metropolis-Hastings 
random- walk sampling procedure 134125 1261 . Given a domain X to be sampled 
and an energy function Q : X — > M, we start with a point id, and suggest a 
new point x\ that is typically "near" x. The point x\ is accepted unconditionally 
if Q{x\ ) > Q{x), and is accepted with probability Q(x\ ) / Q(x) if not. Otherwise, 
we say that x\ was rejected and instead set x\ = x as the current state. After some 
sufficiently large number of such steps t , the expected state of x t is a random draw 
from CP with probability proportional to Q. To generate many random samples 
from CP this procedure is repeated many times. 

In general, Metropolis-Hastings sampling suffers from high autocorrelation, 
where consecutive samples are too close to each other. This can happen when 
far away samples are rejected with high probability. To counteract this problem, 



it is commonplace to use k in place of s, but we reserve k for other notions in this paper 



often Gibbs sampling is used |4TT| . Here, each proposed step is decomposed into 
several orthogonal suggested steps and each is individually accepted or rejected in 
order. This effectively constructs one longer step with a much higher probability 
of acceptance since each individual step is accepted or rejected independently. 
Furthermore, if each step is randomly made proportional to Q, then we can 
always accept the suggested step, which reduces the rejection rate. 

Metropolis-Hastings-Gibbs sampling for partitions. The Metropolis-Hastings 
procedure for partitions works as follows. Given a partition X,-, we wish to select 
a random subset Y C X and randomly reassign the elements of Y to different 
clusters. If the size of Y is large, this will have a high probability of rejection, 
but if Y is small, then the consecutive clusters will be very similar. Thus, we use 
a Gibbs-sampling approach. At each step we choose a random ordering a of 
the elements of X. Now, we start with the current partition X,- and choose the 
first element x a m £ X. We assign to each of the s clusters generating s 

suggested partitions X\ and calculate s quality scores qj = Q{X\). Finally, we 
select index j with probability qj, and assign x<j(i) to cluster j. Rename the new 
partition as X,. We repeat this for all points in order. Finally, after all elements 
have been reassigned, we set X, + i to be the resulting partition. 

Note that auto-correlation effects may still occur since we tend to have 
partitions with high quality, but this effect will be much reduced. Note that we 
do not have to run this entire procedure each time we need a new random sample. 
It is common in practice to run this procedure for some number to (typically 
to = 1000) of burn-in steps, and then use the next m steps as m random samples 
from 7. The rationale is that after the burn-in period, the induced Markov chain 
is expected to have mixed, and so each new step would yield a random sample 
from the stationary distribution. 

3 Grouping the Partitions 

Having generated a large collection Z of m 3> k high-quality partitions from 
J 3 by random sampling, we now describe a grouping procedure that returns k 
representative partitions from this collection. We will start by placing a metric 
structure on 7. This allows us to view the problem of grouping as a metric 
clustering problem. Our approach is independent of any particular choice of 
metric; obviously, the specific choice of distance metric and clustering algorithm 
will affect the properties of the output set we generate. There are many different 
approaches to comparing partitions. While our approach is independent of the 
particular choice of distance measure used, we review the main classes. 

Membership-based distances. The most commonly used class of distances 
is membership-based. These distances compute statistics about the number of 
pairs of points which are placed in the same or different cluster in both parti- 
tions, and return a distance based on these statistics. Common examples include 
the Rand distance, the variation of information, and the normalized mutual 



information H33l4QI42l71l . While these distances are quite popular, they ignore 
information about the spatial distribution of points within clusters, and so are 
unable to differentiate between partitions that might be significantly different. 

Spatially-sensitive distances. In order to rectify this problem, a number of 
spatially-aware measures have been proposed. In general, they work by comput- 
ing a concise representation of each cluster and then use the earthmover's distance 
(EMD) |[20l to compare these sets of representatives in a spatially-aware manner. 
These include CDistanceCIl, ^ADCO^' CC distance E§1, and LiftEMDjSD. 
As discussed inj39|, LiftEMD has the benefit of being both efficient as well as a 
well-founded metric, and is the method used here. 

Density-based distances. The partitions we consider are generated via a sam- 
pling process that samples more densely in high-quality regions of the space of 
partitions. In order to take into account dense samples in a small region, we use 
a density-sensitive distance that intuitively spreads out regions of high density. 
Consider two partitions X; and . Let d : T x J 5 — > M+ be any of the above 
natural distances on J 3 . Then let dz : J 3 x CP — > IR + be a density-based distance 
defined as dz(X;,X,>) = |{X; G Z | d(X/,X/) < c?(X;,X;')}|. 

3.1 Clusters of Partitions 

Once we have specified a distance measure to compare partitions, we can cluster 
them. We will use the notation 0(X,-) to denote the representative partition X 
is assigned to. We would like to pick k representative partitions, and a simple 
algorithm by Gonzalez|22] provides a 2-approximation to the best clustering 
that minimizes the maximum distance between a point and its assigned center. 
The algorithm maintains a set of centers k' < k in C. Let </»c(X,) represent the 
partition in C closest to X; (when apparent we use just (X,-) in place of </>c(X,)). 
The algorithm chooses X,- G Z with maximum value d(X,-,0(Xj)). It adds this 
partition X, to C and repeats until C contains k partitions. We run the Gonzalez 
method to compute k representative partitions using LiftEMD between partitions. 
We also ran the method using the density based distance derived from using 
LiftEMD. We got very similar results in both cases and we will only report the 
results from using LiftEMD in section [4] We note that other clustering methods 
such as &-means and hierarchical agglomerative clustering yield similar results. 

4 Experimental Evaluation 

In this section, we show the effectiveness of our technique in generating partitions 
of good divergence and its power to find partitions with very high quality, well 
beyond usual consensus techniques. 

Data. We created a synthetic dataset 2D5C with 100 points in 2-dimensions, for 
which the data is drawn from 5 Gaussians to produce 5 visibly separate clusters. 
We also test our methods on the Iris dataset containing 150 points in 4 dimensions 
from UCI machine learning repository [18]. We also use a subset of the Yale 



Face Database B [ 19] (90 images corresponding to 10 persons and 9 poses in the 
same illumination). The images are scaled down to 30x40 pixels. 



Methodology. For each dataset, we first run &-means to get the first partition 
with the same number of clusters specified by the reference partition. Using this 
as a seed, we generate m = 4000 partitions after throwing away the first 1000 
of them. We then run the Gonzalez ^-center method to find 10 representative 
partitions. We associate each of the 3990 remaining partitions with the closest 
representative partition. We compute and report the quality of each of these 
representative partitions. We also measure the LiftEMD distance to each of these 
partitions from the reference partition. For comparison, we also plot the quality 
of consensus partitions generated by LiftSSD |[39l using inputs from &-means, 
single-linkage, average-linkage, complete-linkage and Ward's method. 

4.1 Performance Evaluation 

Evaluating partition diversity. We can evaluate partition diversity by determin- 
ing how close partitions are to their chosen representatives using LiftEMD. Low 
LiftEMD values between partitions will indicate redundancy in the generated 
partitions and high LiftEMD values will indicate good partition diversity. The 
resulting distribution of distances is presented in Figures 2(a)| 2(b)| 2(c)| in 



which we also mark the distance values between a representative and its closest 
other representative with red squares. Since we expect that the representative 
partitions will be far from each other, those distances provide a baseline for 
distances considered large. For all datasets, a majority of the partitions generated 
are generally far from the closest representative partition. For instance, in the Iris 
data set ( 2(a)| ), about three-fourths of the partitions generated are far away from 



the closest representative with LiftEMD values ranging between 1.3 and 1.4. 

Evaluating partition quality. Secondly, we would like to inspect the quality of 
the partitions generated. Since we intend the generation process to sample from 
the space of all partitions proportional to the quality, we hope for a majority of 
the partitions to be of high quality. The ratio between the kernel distance quality 
Qk of a partition to that of the reference partition gives us a fair idea of the 
relative quality of that partition, with values closer to 1 indicating partitions of 
higher quality. The distribution of quality is plotted in Figures 3(a]P(fr 3(c) We 



observe that for all the datasets, we get a normally distributed quality distribution 
with a mean value between 0.62 and 0.8. In addition, we compare the quality of 
our generated partitions against the consensus technique LiftSSD. We mark the 
quality of the representative partitions with red squares and that of the consensus 



partition with a blue circle. For instance, chart 3(a) shows that the relative quality 
w.r.t. the reference partition of three-fourths of the partitions is better than that of 
the consensus partition. For the Yale Face data, note that we have two reference 
partitions namely by pose and by person and we chose the partition by person as 
the reference partition due to its superior quality. 



Visual inspection of partitions. We ran multi-dimensional scaling||3l on the 
all-pairs distances between the 10 representatives for a visual representation of 
the space of partitions. We compute the variance of the distances of the partitions 
associated with each representative and draw Gaussians around them to depict 
the size of each cluster of partitions. For example, for the Iris dataset, as we 



can see from chart 4(a) the clusters of partitions are well-separated and are 
far from the original reference partition. In figure [5] we show two interesting 
representative partitions on the Yale face database. We show the mean image 



from each of the 10 clusters. Figure 5(a) is a representative partition very similar 
to the partition by person and figure 5(b) resembles the partition by pose. 



5 Conclusion 

In this paper we introduced a new framework to generate multiple non-redundant 
partitions of good quality. Our approach is a two stage process: in the generation 
step, we focus on sampling a large number of partitions from the space of all 
partitions proportional to the quality and in the grouping step, we identify k 
representative partitions that best summarizes the space of all partitions. 
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Fig. 2. Distance between partition and its representative. 
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Fig. 3. Quality of Partitions. 
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